Coalescing / contiguous access
-
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. The correct way to do it is just have consecutive threads access consecutive memory addresses.
-
GPUs batch many threads (warps/wavefronts).
-
If threads in a group load adjacent addresses, the hardware can merge requests into fewer memory transactions (coalescing).
-
Non-sequential or strided accesses increase transactions and reduce effective bandwidth.
Cache lines and alignment
-
Accesses are serviced in cache-line granularity; unaligned or small scattered loads can cause full-line fetches or multiple lines, increasing bandwidth pressure. Designing buffer layouts for aligned, contiguous reads reduces misses.
Bank conflicts (shared memory)
-
When many threads access the same bank with conflicting addresses, accesses serialize. Layout transforms (padding/transpose) can avoid conflicts.
Texture/texture caches
-
Sampled image access can use specialized caches with different locality assumptions versus raw buffer loads; memory layout (tiling) influences cache efficiency.